Skip to content

fix: add retry logic for one time instruction #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pratikjagrut
Copy link

@pratikjagrut pratikjagrut commented Mar 25, 2025

This PR addresses issue rancher/rancher#48916 by implementing retry logic for one-time instructions.

Changes

  • Added retry logic using retry.OnError for one-time instruction execution

Testing

Manually tested.
Logs:

[email protected]:/home/ubuntu# journalctl -fu rancher-system-agent.service
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Rancher System Agent version 15b7568 (15b7568) is starting"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Starting remote watch of plans"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Initial connection to Kubernetes cluster failed with error Get \"https://69c34c5e4eeb.ngrok.app/version\": tls: failed to verify certificate: x509: certificate signed by unknown authority, removing CA data and trying again"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Starting /v1, Kind=Secret controller"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Detected first start, force-applying one-time instruction set"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="[Applyinator] Applying one-time instructions for plan with checksum 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="[Applyinator] Extracting image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1 to directory /var/lib/rancher/agent/work/20250326-231340/0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749_0"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Using private registry config file at /etc/rancher/agent/registries.yaml"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Pulling image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:14:41Z" level=warning msg="Failed to get image from endpoint: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout"
rancher-system-agent[82604]: time="2025-03-26T23:14:41Z" level=error msg="error while staging: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"

rancher-system-agent[82604]: time="2025-03-26T23:14:41Z" level=error msg="[Applyinator] Execution failed: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:14:41Z" level=info msg="[Applyinator] (attempt #1) Applying one-time instructions for plan with checksum 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:14:41Z" level=info msg="[Applyinator] (attempt #1) Retrying instruction 0 for plan 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:14:44Z" level=info msg="[Applyinator] Extracting image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1 to directory /var/lib/rancher/agent/work/20250326-231340/0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749_0"
rancher-system-agent[82604]: time="2025-03-26T23:14:44Z" level=info msg="Using private registry config file at /etc/rancher/agent/registries.yaml"
rancher-system-agent[82604]: time="2025-03-26T23:14:44Z" level=info msg="Pulling image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:15:14Z" level=warning msg="Failed to get image from endpoint: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout"
rancher-system-agent[82604]: time="2025-03-26T23:15:14Z" level=error msg="error while staging: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"

rancher-system-agent[82604]: time="2025-03-26T23:15:14Z" level=error msg="[Applyinator] Execution failed: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:15:14Z" level=info msg="[Applyinator] (attempt #2) Applying one-time instructions for plan with checksum 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:15:14Z" level=info msg="[Applyinator] (attempt #2) Retrying instruction 0 for plan 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:15:20Z" level=info msg="[Applyinator] Extracting image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1 to directory /var/lib/rancher/agent/work/20250326-231340/0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749_0"
rancher-system-agent[82604]: time="2025-03-26T23:15:20Z" level=info msg="Using private registry config file at /etc/rancher/agent/registries.yaml"
rancher-system-agent[82604]: time="2025-03-26T23:15:20Z" level=info msg="Pulling image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:15:50Z" level=warning msg="Failed to get image from endpoint: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout"
rancher-system-agent[82604]: time="2025-03-26T23:15:50Z" level=error msg="error while staging: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"

rancher-system-agent[82604]: time="2025-03-26T23:15:50Z" level=error msg="[Applyinator] Execution failed: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:15:50Z" level=info msg="[Applyinator] (attempt #3) Applying one-time instructions for plan with checksum 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:15:50Z" level=info msg="[Applyinator] (attempt #3) Retrying instruction 0 for plan 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:16:03Z" level=info msg="[Applyinator] Extracting image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1 to directory /var/lib/rancher/agent/work/20250326-231340/0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749_0"
rancher-system-agent[82604]: time="2025-03-26T23:16:03Z" level=info msg="Using private registry config file at /etc/rancher/agent/registries.yaml"
rancher-system-agent[82604]: time="2025-03-26T23:16:03Z" level=info msg="Pulling image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:17:33Z" level=warning msg="Failed to get image from endpoint: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout"
rancher-system-agent[82604]: time="2025-03-26T23:17:33Z" level=error msg="error while staging: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"

@pratikjagrut pratikjagrut force-pushed the retry.onetime.instruction branch from 1d3de06 to 15b7568 Compare March 26, 2025 22:50
@pratikjagrut pratikjagrut changed the title [Draft]fix: add retry logic for one time instruction fix: add retry logic for one time instruction Mar 26, 2025
@pratikjagrut pratikjagrut marked this pull request as ready for review March 26, 2025 23:36
@pratikjagrut pratikjagrut force-pushed the retry.onetime.instruction branch from 15b7568 to 845d8b0 Compare March 27, 2025 20:03
Copy link
Contributor

@HarrisonWAffel HarrisonWAffel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the right approach. If an instruction fails I would still expect the failure count to be updated in the secret before another attempt is made.

Have we looked into potentially resetting the appliedChecksumKey field in the secret after the system agent restarts and reattempts the instruction? It may be that this needs to be the general behavior, as opposed to plan/Windows specific (which it currently is IIRC).

@pratikjagrut pratikjagrut force-pushed the retry.onetime.instruction branch from 845d8b0 to dcb804b Compare April 4, 2025 19:52
@pratikjagrut
Copy link
Author

@HarrisonWAffel, I made the changes per your suggestion. Can you please take a look?

Copy link
Collaborator

@jakefhyde jakefhyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-functional change request: discussion on edge cases.

@@ -215,10 +215,10 @@ func (w *watcher) start(ctx context.Context, strictVerify bool) {
logrus.Infof("Detected first start, force-applying one-time instruction set")
needsApplied = true
hasRunOnce = true
secret.Data[appliedChecksumKey] = []byte("")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HarrisonWAffel I believe this will work, but I think we need to address ResetFailureCountOnStartup if we are changing the default behavior. Do you think we would be fine removing it from the plan and then removing it from Rancher? What about re-running failed plans at startup? Asking because these seem like mostly windows cases so I need a refresher on the context (I remember there is some interesting behavior regarding potentially cyclic/competing services).

There is also the fact that this introduces a change to the system-agent-upgrader plan which will re-run the latest plan during the upgrade, not a bug but definitely worth noting, since plans may be running during agent upgrades (all ready a possibility now, but much more likely something shows up in the UI).

Copy link
Contributor

@HarrisonWAffel HarrisonWAffel Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v2.10.0 the planner was updated to reattempt the Windows install plans multiple times before marking them as failed, as there can be transient issues that are not representative of a true plan failure. The problem I encountered was that, if a plan failed to apply 3 times but then succeeded, it would only be reattempted two times after the next reboot as the failure-count would still be equal to 3.

Handling this situation was complicated by an issue in rke2 which resulted in calico HNS namespaces being deleted each time rke2 was restarted (typically via the one-time-instruction). In that case, the plan should not be reattempted. If it was, the node might eventually be marked as available, even though some behavior (like deleting pods) would be completely broken. The solution was to introduce this field and conditionally set it based off of the clusters k8s version.

I think we still need to consider that situation. The rke2 fix was delivered in August of last year, so users have had plenty of time to upgrade, but removing this field and changing the default behavior could potentially silently break some existing clusters. I would be in support of doing that for 2.12, and communicating it in the release notes.

The existing change for applied-checksum shouldn't run into the above issue though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants